The answers to all the fill-in-the-blank sections can be found here.

library(tidyverse)
library(rvest)

rvest

We’ll use a new package called rvest and it’s part of the tidyuniverse bundle.

Load the library.

library(rvest)

Let’s pull reviews from Amazon for this digital voice recorder.

Let’s track them down.

Scroll down to near the bottom of the product page are about 8 listed reviews. At the bottom of that section is a link for See all reviews>. Let’s go see those reviews.

Note: Here’s a tip.

The URL to the reviews may look like (once you strip out all the ref tracking code:

https://www.amazon.com/48GB-Digital-Voice-Recorder-Dictaphone/product-reviews/B09XF2Z12G/

But if you try this link, it works just as well.

https://www.amazon.com/product-reviews/B09XF2Z12G/

Turns out you don’t need the product name, only the Amazon id.

Let’s request the data from the server.

We’ll use the read_html() function. It takes a URL as an argument.

Sub in the html of the digital voice recorder (second link) and the correct function name and run the code.

url <- "_____"

amazon <- _________(url)

amazon

Great job, we’ve imported an html page into R.

Now let’s extract what we want.

Parse HTML

There are a lot of data we could get but let’s focus on these elements:

We’ll need to find the HTML/CSS elements to target.

One way is to use inspect element but since we’ve already installed selector gadget we’ll use that instead.

Okay, scroll past the first two banner reviews and go to the reviews in the list.

Click on the selector gadget button in the browser and click on the stars.

A pop up box will show up on the bottom right and after you click on the first stars, .review-rating will appear. What you selected also is highlighted in green and similar elements on the web page that match that .review-rating will also be highlighted green and yellow.

However, when double checking, I see that the two banner reviews are included. I don’t want those! So I can click on the green of one of them and it will exclude that and selector gadget will try to guess another element. This time, #customer_review etc etc etc, which is way too long and too specific because it excluded the stars below the first star we wanted. So we click on the stars beneath the first one we wanted and we finally get all the stars we wanted. Look below to see how the path changes with each additional click to remove or add elements until we get the final one we want.

This time the final CSS class we want is #cm_cr-review_list .review-rating.

We need to pass that string into a new function: html_elements(). (An older version of rvest used to be html_nodes())

We already have the read html saved in the amazon object.

Use the new function and the class below.

amazon %>% 
  _________("__________________")

Alright, we’re closer.

What that did was returned all the nodes in the HTML that have this particular #cm_cr-review_list .review-rating class.

Now we need to extract the actual review number!

Let’s pipe in html_text() with no arguments needed.

ratings <- amazon %>% 
  _________("__________________") %>% 
  _________()

ratings
ratings <- amazon %>% 
  html_elements("#cm_cr-review_list .review-rating") %>% 
  html_text()

ratings

Great job!

Now we have an array with various star ratings.

We can go back and clean these later so we have just the raw number.

So, the read_html() and html_elements() and html_text() functions are all you’ll need for many common web scraping tasks.

There are others, though.

To check it all out, visit the rvest site.

Keep going!!

More scraping

In your browser, to exit the select gadget tool, you must click the “X” button on the far right of the pop up box.

Or click the ‘clear’ button if you want to select a new element.

Which is what you want to do!

Now, use the select gadget to find the path for the dates of the reviews (it may take 3 or 4 clicks of selector gadget to find just the right selection). We’ll save it to the object “dates”.

dates <- amazon %>% 
  html_elements("__________________") %>% 
  _________()

dates

Finally, get me the text of the reviews from customers and save it to the object “review_text”.

This may take 3-5 clicks on selector gadget to get just the right selection. Don’t include banner text reviews and don’t include text from other parts of the website.

review_text <- amazon %>% 
  _________("__________________") %>% 
  _________()

review_text

Great job. Let’s bring it all together in one pretty data frame.

some_reviews <- data.frame(ratings, dates, review_text)

glimpse(some_reviews)

That’s a lot of strings!

We want the numbers and dates!

Okay, let’s clean it up a bit with some stringr and lubridate functions.

some_reviews <- some_reviews %>% 
  mutate(
         # Need to get the ratings into a numeric format instead of sentence format
         ratings=str_replace(ratings, " out of 5 stars", ""),
         ratings=as.numeric(ratings),
         # Need to turn the dates into a date format instead of sentence format
         dates=str_replace(dates, ".* on ", ""),
         dates=mdy(dates),
         # Let's get rid of extra spaces and \ns that indicate line breaks in the html code
         review_text=str_trim(review_text))

glimpse(some_reviews)

Ah, much nicer.

Looped scraping

Alright, not bad!

We have 10 reviews so far.

But how do we get all the reviews?

Loops!